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This presentation covers three topics: 1) a brief general description of sequence comparison; 2) 
a description of the development of sequence comparator for phoneme-to-phoneme sentence 
alignment; and 3) a brief report on some results obtained with the comparator. Sequence comparison 
methods appear to have been discovered independently in several countries during the late 1960s and 
early 1970s (Kruskal, 1983). The goal of sequence comparison is to obtain an alignment between 
strings for which elements may have been deleted, inserted, or substituted (either as exact matches or 
replacements). To accomplish this goal, sequence comparison comprises two parts: the concept of 
distance between elements (costs) and algorithms to minimize total distance between strings. The 
current application is based on the description of sequence comparison in Sankoffand Kruskal (1983). 

My interest in sequence comparison arose in the context of research on sensory aids for 
profoundly deaf people. The main goal for these devices (such as tactile aids and cochlear implants) is 
to improve speech communication. Usually this means enhancing a subject's ability to lipread 
(speechread). In developing and testing sensory aids, it is desirable, therefore, to employ evaluation 
measures applied to connected speech. It it also desirable to employ a testing procedure that is simple 
and imposes few constraints on subjects. Asking a subject what had just been said seemed such a 
straightforward, simple approach. Traditionally, when open set identification of this kind is employed, 
results are scored in terms of words or keywords correct. However, we had observed responses to the 
task of lipreading with or without a sensory aid that contained few or no words correct and yet 
appeared to be phonetically similar to the stimulus. It was hypothesized that much could be learned by 
studying the patterns of errors in responses, were it possible to obtain a systematic means of 
phonemically aligning the stimulus with the response. 

The following stimulus-response sentence pair contains several common characteristics of 
errors from lipreading and problems that must be solved in generating alignments. 

Stimulus: Proofread your final results. 
Response: Blue fish are funny. 

The initial consonants Ibl and /p/ are visually similar, hence the predictable substituti^r The /u/ in 
pmpf and blue are similar although spelled differently. It appears that a word boundary has been 
misparsed, such that the final lil in proof is identified as the initial Ifl in funny . Other correct phonemes 
occur in the stimulus and the response in roughly the same order and location, such as /rf7 in your final 
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versus are funny , although the words do not match. The obtained sequence comparator (Bernstein et 
al., 1993a) solved the above alignment in the following manner: 1 

Stimulus: pru tMd#yur#fAnL#r|z A lts 
Response: blu#f-IS#a-r#f A n-#-i- — 

AN EXTREMELY BRIEF INTRODUCTION TO SEQUENCE COMPARISON 

As mentioned above, sequence comparison has two main parts, metrics for measuring distance 
or similarity and algorithms for minimizing distance between sequences. The data submitted to the 
sequence comparator are: 

Stimulus sequence: a = ai...a m 
Response sequence: b = bi... b„ 

The sequence comparator also needs costs for inserting a phoneme in a response, costs for deleting a 
phoneme from a stimulus, and costs for substitutions (exact matches or replacements). Like the data 
submitted to the sequence comparator, the costs are determined prior to initiating the alignment 
process. As a step in the solution to the alignment problem, a stimulus-response matrix is constructed 
whose cells are the entries (% bj). Each of the cells is processed according to a recurrence algorithm 
whose goal is to obtain a minimum distance between sequences and a phoneme-to-phoneme alignment 
of sequences. The basic recurrence equation is: 



djj = min 



di-i.j + deletion of aj 

di-i, j-i + substitution of a with bj 

di, j-i + insertion of bj. 



The recurrence equation implies that cells with lower subscripts will be processed before cells with 
higher subscripts. At each cell, the minimization is used to decide whether the cell should result in an 
insertion, a deletion, or a substitution. When the final cell, (a^bn), has been processed, the value d™ is 
the minimum distance between the sequences. A pointer equation corresponds to the recurrence 
equation: 



pointer (i j) = 



i-l,j OR 
i-lj-l OR 

i } j - 1 if term above is minimum. 



For every cell that is processed, one or more pointers is generated to be used for constructing the 
alignment. Each of the three expressions on the right of the pointer equation corresponds respectively 
to the three alignment options in the recurrence equation. If more than one value is minimum in the 



1 Note that the transcriptions in this paper are given in the DECtalk single-character notational system. 
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recurrence equation, more than one pointer will be generated and subsequently a corresponding 
alignment. That is, several different minimal alignments can be generated. Pointers are followed 
beginning with the final cell that is processed, (a^ bn). Figure 1 shows a fragment of a possible costs 
matrix; a stimulus-response matrix with minimal distances shown in the lower right-hand corner of each 
cell; and pointers corresponding to the manner in which the phonemes are to be aligned. (See Sankoff 
and Kruskal, 1983, for other examples using the same notation.) 

DEVELOPMENT OF A PHONEME-TO-PHONEME SEQUENCE COMPARATOR 

Since the aim of the work was to study lipread sentences, it was necessary to obtain a distance 
metric that applied to visual-phonetic similarity. The literature on lipreading contains numerous 
studies whose goal was to define a unit of visual-phonetic similarity known as the viseme . Visemes are 
visual equivalence classes of putatively noncontrasting phonemes (see Fisher, 1968; Owens and Blazek, 
1985). For example, lb p ml are considered a viseme, because they are visually ambiguous according 
to typical criteria. An initial comparator used the simple recurrence equation above with a 
viseme-based costs matrix. Table 1 shows how costs were assigned (Bernstein et al., 1993a). 



Table 1. Costs of seven types of elementary alignments. 


Tvpe of Elementarv Alignment Example 


Cost 


Exact match 


a, a 


Substitution within a viseme group 


b, p, m 


Substitution within consonants, but across visemes 


b,g 


Substitution within vowels, but across visemes 


a,i 


Substitution of consonants for vowels and vice versa 


a,b 


Insertion of a vowel or consonant in the response 




Deletion of a vowel or consonant in the stimulus 





Evaluation of the comparator made use of data from 139 normal-hearing and normal-vision 
young adults who lipread the 100 CID Everyday Sentences (Davis and Silverman, 1970) recorded on 
video laserdisc. Subjects lipread each sentence, and then typed at a computer terminal what the talker 
had said. 12,291 responses were obtained. Responses were checked for spelling errors and were 
corrected whenever errors were unambiguously due to spelling. Each response sentence was then 
transcribed using DECtalk, a text-to-speech synthesizer that produces a quasi-phonemic transcription 
as one stage in its transcription process. Transcription errors were corrected. Then the transcribed 
stimulus-response sentence pairs were submitted to the sequence comparator. Alignments and various 
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Alignment Example 

COST MATRIX: 





b 


1 


u 
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0 


3 
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2 



Stimulus: boo! 
Response: blue 



5 

"3 
E 
•v 

(Z2 



Response 




Substitution Deletion 



Insertion 




Minimum 
Distance 



Alignment: 

Stimulus: 
Response: 



b-u 
blu 



Figure 1. Schematic representation of sequence comparison: a fragment of a 
hypothetical costs matrix, a stimulus-response matrix, and a diagram of the 
manner in which to interpret pointers. 
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measures were obtained, such as minimum distance, number of phonemes correct, number of 
phonemes deleted, and number of phonemes inserted. 

Unfortunately, the combination of the simple recurrence equation and the viseme-based costs 
matrix resulted in inadequate constraint over the alignments. Numerous alternate equal-distance 
alignments were generated. Figure 2 shows four such alternate alignments for one sentence pair. 
Subsequently, several modifications described in detail in Bernstein et al. (1993a) were implemented. 
A more complex algorithm was implemented that charged an extra cost for initiating strings of 
insertions or deletions, thus reducing the use of insertions and deletions and the consequent 
fragmentation of response words. Still, however, an unacceptably high number of equal-cost 
alignments was obtained, although some improvement was achieved. 

A costs matrix was then developed based on consonant confiisions obtained in a nonsense 
syllable identification task. Multidimensional scaling was used to obtain Euclidean distances among 
consonants and among vowels. The new costs matrix provided additional resolution for reducing the 
number of equal distance alternate alignments. Unique alignments were obtained for 78% of 
stimulus-response pairs, and dual alignments (two alternate alignments) were obtained for 17% of 
pairs. A large proportion of the dual alignments involved a single elementary alignment, that is, one 
phoneme. The combination of the enhanced algorithm and the Euclidean distances was judged 
informally via inspection to be an adequate solution to the alignment problem. 

A VALIDATION EXPERIMENT 

A problem for validating the sequence comparator was the absence of independently generated 
alignments with which the comparator's performance could be compared. It was not possible to 
validate the comparator against human judgments, since only the comparator could be expected to 
systematically and reliably obtain alignments. A different tack was taken, an evaluation of whether the 
comparator was sensitive to whether stimulus-response pairs were true or randomly assigned 2 A main 
question was whether a large number of phoneme-to-phoneme alignments would be obtained 
regardless of whether the response was paired with its true stimulus. 

The validation experiment used the same database of responses to CID Everyday Sentences as 
described above. One set of stimulus-response pairs were the true ones as collected in experimental 
sessions, and the other set were the randomly reassigned pairings. The results showed that exact 
phoneme matches were extremely rare in alignments of random pairs. 11,892 (96.7%) of randomly 
assigned sentence pairs resulted in five or fewer exact phoneme matches. 5,039 (41%) of true 
stimulus-response pairs resulted in six or more exact phoneme matches. Figure 3 shows the number of 
sentences as a function of number of phonemes correct for true and random pairs (Bernstein et al., 
1993a). 



2 Note that the selection of response for each of the stimuli was random in the case of the random pairs. 
The alignment procedure operated identically on the random and true pairs. 
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Alternate Equal-Distance Alignments 



Stimulus: Here's a nice quiet place to rest. 
Response: That's the way it goes. 



Alignment 


1 








Stimulus : 


hlrs 


-xnAskwA 


tplestUrEst 




D@tsDx-- 


- -weltg-oz 


Alignment 


2 








Stimulus : 


hlrs 


-xnAskwA 


tplestUrEst 


Response: 


D@tsDx-- 


--welt go- -z- 


Alignment 


3 








Stimulus : 


hlrs 


-xnAskwA | 


tplestUrEst 


Response: 


D@tsDx- - 


- -w- - 


elt-goz - 


Alignment 


4 








Stimulus : 


hlrs 


-xnAskwA 


tplestUrEst-- 


Response: 


D@tsDx- - 


- -welt goz 



Figure 2. Four alternate equal-distance alignments. 
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Phoneme confusion matrices were constructed by extracting the individual 
phoneme-to-phoneme alignments from the database of alignments for true and randomly paired 
sentences. An information measure, substitution uncertainty in bits, was calculated for each phoneme 
in each of the databases as: 

-Zpklog2Pk, 

where pk is the proportion of responses in category k and k is an index of summation that represents 
each possible substitution error. Figure 4 shows average phoneme substitution uncertainty for 
phonemes in true and random sentence pairs. Uncertainty is always higher for the phonemes in random 
pairs. 

It was possible to examine also the confusion matrices in terms of more conventional 
transmitted information (TI) analyses using features. TI analysis uses the data in the entire confusion 
matrix, whereas substitution uncertainty considers only the off-diagonal entries. TI analysis was 
conducted using 12 features from Chomsky and Halle (1968) and two additional features, duration and 
frication, from Miller and Nicely (1955). Figure 5 shows proportion TI for the features that emerged 
as important when a sequential information analysis (Wang and Bilger, 1973) was applied to the two 
confusion matrices 3 . The figure shows that, in contrast with the data from alignments of true pairs, 
very little information was present in the confusion matrix from alignment of randomly assigned pairs. 
In summary, the results of the validation experiment suggest that the sequence comparator is sensitive 
to the nature of the data submitted to it. 

APPLICATION OF SEQUENCE COMPARISON TO A NORMATIVE STUDY OF 
LIPREADING IN DEAF VERSUS HEARING SUBJECTS 

Next, an experiment was conducted to determine whether measures from the sequence 
comparator are sensitive to subject population differences (Bernstein et al., 1993b). A normative study 
was conducted in which 96 adult subjects with normal hearing and 72 adult subjects with profound 
hearing impairment lipread nonsense syllables, isolated words, and sentences. The sentence stimuli of 
interest here were 50 of the CID Everyday Sentences. The lipreading task was as described above. 
The data preparation was the same as described previously. 

Figure 6 shows percent phonemes correct in sentences for the two groups. The figure shows 
that on average the deaf subjects were more accurate lipreaders than were the hearing subjects. This 
result is predictable, since percent words correct was approximately 20% for deaf and 40% for hearing 
subjects, a result that was obtained with the more conventional words correct scoring. The use of the 
sequence comparator does not actually contribute much to our understanding at this level of analysis, 
except to afford information about specific phonemes. The phoneme substitution uncertainty measure 
does contribute novel insight. Figure 7 shows phoneme substitution uncertainty obtained for each of 
the phonemes in alignments from the two subject groups. The figure shows that substitution 
uncertainty is higher for the hearing subjects for almost every phoneme. An interpretation of this result 

3 Note that the figure does not give conditional proportions TL although selection of the features was 
based on sequential information analysis (Wang & Bilger, 1973). 
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is that substitutions (i.e., replacements only) made by deaf subjects are more systematic than those 
made by hearing subjects. 

The more conventional TI analysis on the confusion matrices also showed a difference 
between subject groups. In Figure 8 it can be seen that the proportion TI was higher for the deaf 
subjects for each of the features. 

Since subjects had also performed forced-choice CV nonsense syllable identification for 22 
initial consonants, it was possible to compare TI across stimulus materials. Figure 9a-b shows results 
for nonsense syllables versus sentences for each of the subject groups. Note that the sentence data are 
the same as in Figure 8. Figure 9a-b shows that deaf subjects were more successful identifying 
nonsense syllables. Of more interest, however, is that 1) higher levels of TI were obtained with 
sentence materials than with nonsense syllables, and 2) somewhat different features emerged as 
important with the two different types of stimulus materials. The features high and consonantal were 
only important for nonsense syllable identification. The features nasal, vocalic, and voicing were only 
important for sentence identification. Since the visual phonetic stimulus does not afford all the featural 
distinctions (such as voicing), presence of these features in the sentence data can be attributed to the 
recognition of words. 

In summary, the sequence comparator was shown to be sensitive to subject group differences. 
Substitution uncertainty measures suggest there is a qualitative difference between subject populations. 
The capability to extract confusion matrices from alignments was shown useful in comparing TI 
analyses for nonsense syllable identification versus open set sentence identification. 

OTHER USES FOR ALIGNMENTS 

Sentence Histograms 

Another use for alignments appears in a paper by Demorest and Bernstein (1991). They 
introduce the sentence histogram, a figure showing performance accuracy on a phoneme-by-phoneme 
basis throughout a sentence (see Figure 9). Although we have not examined such histograms formally 
in great detail, informal study suggests that the perceptual process of visual speech perception may 
involve attempting to spot words that may become salient in the context of otherwise ambiguous or 
unintelligible speech. The reason for this hypothesis is that we have observed islands of correct word 
identifications embedded in otherwise incorrect responses (as in Figure 10). This hypothesis 
contradicts a common explanation for how lipreading is accomplished. That is, the lipreader is said to 
use context to recover unintelligible words. This explanation cannot account for identification of 
words in otherwise unintelligible contexts. Our observations have led to the hypothesis that lipreading 
is data driven to a far greater extent than has heretofore been thought. 

Word boundary detection 

A different phenomenon that can be observed in our data is failure to detect word boundaries. 
Consider the alignments in Figure 1 1 for responses to the stimulus, She'll only be gone a few minutes. 
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These alignments were chosen because they provide evidence for systematic word boundary errors that 
may reflect effects of normal processes that contribute to segmentation. 

In the six examples, the /L/ in the stimulus word she'll is aligned with the word-initial 
consonant of a response word. It appears that the more prominently stressed lol in only has 
"captured" the preceding consonant. Later in the sentence, correct parsing is reestablished, likely 
due to the high visibility of be. 

Recently, Cutler and Butterfield (1992) described an experiment in which subjects listened 
to connected speech with controlled stress rhythm of strong and weak syllables, controlled lexical 
stress in terms of the location of stressed syllables within multisyllabic words, and controlled 
phonetic length of vowels. Subjects reported what they heard. Because the stimuli carefully 
controlled prosodic factors, it was possible to investigate systematic patterns of boundary shift 
errors hypothesized to be due to prosody. However, relatively few of the obtained responses 
could be evaluated, because the investigators did not have a method to make use of partial 
responses. Responses that did not explicitly reflect the number of syllables and the rhythmic 
pattern in the stimulus were rejected, and only 42% of the responses satisfied criteria. 
Nevertheless, support was obtained for the hypothesis that listeners use prosodic information to 
parse word boundaries. With sequence comparison, this type of interesting hypothesis could be 
more efficiently and elaborately investigated. 

Conclusions 

The experiments reported here demonstrate that sequence comparison can be applied to 
research on perception of connected speech. The sequence comparator produces several different 
numerical measures such as number of phonemes correct, number of insertions, deletions, and 
substitutions, and also phoneme-to-phoneme alignments. Alignments can be submitted to further 
analysis in which patterns of response are extracted. Although the data discussed here were from 
lipreading experiments, sequence comparison techniques can be applied to auditory and audio- 
visual connected speech as well as to visual speech perception. 
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Assessment of individual differences in speech perception requires standardized tests that are 
sensitive to the relevant sources of variability in test scores (i.e., valid) and insensitive to irrelevant, 
extraneous sources of variability (i.e., reliable). Test reliability has traditionally been evaluated by 
examining extraneous sources of variability independently of one another. For example, retest 
reliability evaluates the consistency of test scores over time, with test occasion being the extraneous 
variable. Alternate-form reliability evaluates the consistency of scores over different test forms, with 
test form being the extraneous variable. Split-half reliability and internal consistency reliability evaluate 
consistency of performance over items within a single test form, and interscorer reliability reflects 
consistency across scorers. 

Generalizability theory (Cronbach, Gleser, Nanda, and Rajaratnam, 1972) is a statistical theory 
of sources of variability in behavioral observations that permits estimation of the effects of several 
extraneous variables, and their interactions, within a single experiment. A generalizability study is an 
experiment in which potential sources of variability in test scores are manipulated. A statistical model 
for a single observation and an analysis-of-variance model appropriate for the experimental design are 
specified. Next, expected values of the mean squares from the analysis of variance are determined and 
used to estimate the variance component for each source of variability in the observations. 

As an example, consider a study conducted by Demorest and Cord (1993) in which four 
monosyllabic word lists (NU-6) were administered on each of two days to a sample of 40 
hearing-impaired adults. The sources of variability were the test list and the test occasion. The 
Statistical model for the score of one subject on a given list on a given day is: 

X= \i + a x +a 2 + a 3 + a* + as + 8, 

where \x is a grand mean, the a parameters represent the effects of Subject, List, Day, List x Day, 
Subject x List, and Subject x Day, and 8 is random, residual error. Given this model for a single score, 
the variance of observed scores is: 



Based on a paper submitted to the Journal of the Academy of Rehabilitative Audiology . This work 
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a x 2 = a x 2 + a 2 2 + cr 3 2 + a 4 2 + a 5 2 + a 6 2 + a e 2 



The goal of the generalizability analysis is to estimate each of these variance components and their 
contribution to the total observed variance. Of all of these sources, only the first, that for Subject, 
reflects relevant variance; all other components are extraneous to the purposes of testing. 

The expected value of each mean square in the analysis of variance is a linear combination of 
the variance components. By equating each mean square to its expected value, estimates of the 
variance components can be obtained. For example, the expected value of the mean square for the 
interaction of Subject x Day, MS 6 , is 4a 6 2 + a E 2 , whereas the expected value of the mean square for 
residual error, MS E , is a e 2 Thus \MS 6 - MS E ]/4 provides an estimate of a 6 2 An algorithm for deriving 
the expected values of mean squares is given in Winer (1971). 

Analysis of the data from Demorest and Cord (1993) produces the following estimates for each 
component and for the total variance: 

- 2 * 2 * 2 ~ 2 ^2,^2,^2,-2 

a x +c 2 +a 3 +a 4 +a 5 +a 6 +a e 

70.74 - 57.34 + .32 + 0 + 0 + 1.52 + 4.84 + 6.73 

The largest source of variability is Subject, accounting for 81.1% of the total variance of observed 
scores. Although List is a statistically significant (i.e., non-zero) effect, its magnitude is quite small. 
Effects for Day and the interaction of List x Day produce negative variance estimates, which have been 
set to zero. The interactions of Subject x List and Subject x Day, and the residual error (which 
represents the combined effects of all other sources of variability) account for 2. 1%, 6.8%, and 9.5% of 
the variance, respectively. 

Bilger, Nuetzel, Rabinowitz, and Rzeczkowski (1984) performed a generalizability analysis of 
the Speech Perception in Noise (SPIN) test in which several variables were manipulated. Given the 
large number of subjects in their study (N = 128), the large number of lists (10), and the multiple 
conditions of testing/scoring, they found that many irrelevant sources of variability were statistically 
significant. Their variance component analysis, however, revealed that the magnitude of many of these 
effects was trivial. Of particular importance was the finding that differences between methods of 
scoring (immediate write-down by the examiner versus transcription from a recording of the subject's 
response) were virtually zero. 

Generalizability analysis has also been used by Demorest, Bernstein, and Tucker (1993) to 
compare speechreading performance in two populations of subjects. Normal-hearing subjects (N = 96) 
and hearing-ppaired subjects (N=72) speechread 50 video-recorded CID Everyday Sentences, half 
spoken by a female talker and half spoken by a male talker. The unit of observation was the number of 
words correct on a single sentence. The effect of Group (normal vs. impaired hearing) accounted for 
7.4% of the variance in scores, while individual differences among subjects within the groups 
accounted for 19.0%. Individual test items accounted for 18.5% and residual error 51.1%. The 
remaining variance components combined accounted for only 4% of the total variance. The latter 
finding is important because it suggests that the interaction of Group x Item is a small effect. Thus, in 
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developing speechreading materials (tests) for hearing-impaired individuals, it would be possible to use 
subjects with normal hearing because the relative difficulty of items is very similar for the two groups. 
Likewise, absence of a Group x Talker interaction implies that talker differences are the same within 
each population. 

When data for the hearing-impaired and normal-hearing groups were analyzed separately, the 
variance component for Subject was more than twice as large in the hearing-impaired group (2. 14 vs. 
0.99). Differences among test items were more than three times as great (2.75 vs. 0.81 ). It appears 
that the wider range of speechreading ability in the hearing-impaired sample was also reflected in the 
mean performance on individual items. Other variance components, however, were very similar for the 
two groups. Variance attributable to Talker was essentially zero, as was the interaction of Subject x 
Talker, and residual error variances were 4.67 and 3.49 for the hearing-impaired and normal-hearing 
groups, respectively. 

Generalizability analysis yields coefficients of generalizability . which are analogous to reliability 
coefficients. Each coefficient is based on a data collection model for testing and a universe of 
generalization for test score interpretation. Together these determine which sources of variability affect 
observed scores and universe scores. (The latter are analogous to true scores in classical test theory.) 
The coefficient equals the ratio of universe-score variance to observed-score variance. For example, 
given the data from Demorest and Cord (1993) on NU-6 word lists, we might specify a data collection 
model as administration of a single list to a subject on a given day. Table 1 shows the estimated 
generalizability coefficient for four universes of generalization. Also shown are average reliability 
coefficients obtained from the same data. For example, the generalizability coefficient for 
generalization across lists, but not days, is analogous to an alternate-form reliability coefficient. The 
average of all the alternate-form reliability coefficients in these data gives a value virtually identical to 
the generalizability coefficient. Generalizability theory also makes it possible, however, to estimate 
immediate retest reliability (same list, same day, r = .904), even though no immediate retests were 
given. 



Table 1. Estimated Generalizability Coefficients for Four Universes of Generalization and Analogous 
Reliability Coefficients from Demorest and Cord (1993), 



Universe of Generalization 


Estimated Generalizability 
Coefficient 


Mean Observed Reliability 
Coefficient 


Across Lists and Days 


.814 


.808 


Across Lists for a Given Day 


.883 


.877 


Across Days for a Given List 


.836 


.832 


None: A Given List on a Given Day 


.904 
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Generalizability theory is especially useful for estimating the number of test items needed to 
achieve a particular level of generalizability. For example, Demorest and Bernstein (1992) presented 
speechreading data on 104 subjects with normal hearing who viewed 100 video-recorded CID 
Everyday Sentences, 50 for each of two talkers. The unit of observation was the subject's score on a 
single sentence scored in terms of total words correct. Generalizability coefficients were estimated for 
three models of data collection and generalization: 

Model 1 : Test with a single talker, generalize over all test items by this talker. 

Model 2: Test with a single talker; generalize over all test items and both talkers. 

Model 3 : Test some subjects with one talker, others with the other talker; generalize 
over all test items and both talkers. 

As can be seen in Figure 1 (adapted from Demorest and Bernstein, 1992), generalizability is highest for 
Model 1 and lowest for Model 3. All three functions, however, begin to plateau at about 30-40 items, 
suggesting that for these recordings of these materials, individual differences among subjects can be 
reliably estimated with about 40 sentences. 

Generalizability theory provides an integrated framework for evaluating multiple sources of 
variability in behavioral observations and for deriving implications for test development and test score 
interpretation. It has only recently begun to be applied in the domain of speech perception, but as the 
examples presented here illustrate, it can provide valuable insights about individual differences, both 
within and between normal-hearing and hearing-impaired populations. 
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